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ABSTRACT 


Rainfall data are the most significant values in hydrology and climatology 
modelling. However, the datasets are prone to missing values due to various 
issues. This study aspires to impute the rainfall missing values by using 
various imputation method such as Replace by Mean, Nearest Neighbor, 
Random Forest, Non-linear Interactive Partial Least-Square (NIPALS) 
and Markov Chain Monte Carlo (MCMC). Daily rainfall datasets 
from 48 rainfall stations across east-coast Peninsular Malaysia were used in 
this study. The dataset were then fed into Multiple Linear Regression (MLR) 
model. The performance of abovementioned methods were evaluated using 
Root Mean Square Method (RMSE), Mean Absolute Error (MAE) 
and Nash-Sutcliffe Efficiency Coefficient (CE). The experimental results 
showed that RF coupled with MLR (RF-MLR) approach was attained as 
more fitting for satisfying the missing data in _ east-coast 
Peninsular Malaysia. 
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1. INTRODUCTION 


In climatology and hydrological modeling, daily rainfall data is among the significant variables. 
Water resources management requires comprehensive hydrological variables datasets, including volumes, 
temperature and water level. Nevertheless, many hydrologists had commonly encountered the challenge 
of missing data in hydrological datasets. Normally, the missing data occurred for various reasons such 
as relocation of rainfall station, environmental changes, malfunctioning instruments and reorganization 
of network [1]. In hydrology, there are three type of missing data were taken into account, which are Missing 
Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). As 
for hydrological data especially in the case of missing in rainfall datasets, it is classified as MCAR since 
the data in that area or any area does not affect the occurrence of missing in rainfall datasets of an area [1, 2]. 
It had been reported by [3] that the imputation for univariate time series hydrological data were also 
classified as MCAR and MAR. MCAR concerns the data where the chance of a particular missing values are 
independent of any dataset variables [2]. The most convenient practice to handle the missing data 
is by deleting the entire observations containing the missing data and analyzing the retained complete data. 
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However, ignoring or moving the data could be unsuitable as it could possibly make discontinuous data, 
leading to information loss. Consequently, there is possible outcome of unfitting conclusions. 
The consistency and continuity of rainfall data are highly crucial in statistical analyses like time series 
analysis [4]. The continuity and consistency might be instable as well because of the observational procedure 
modification and inadequate records. [2] Therefore, filling the datasets in daily rainfall data is critical. 

In hydrologic modeling, using the most efficient method to acquire precise valuation of rainfall 
is highly important. The most fitting valuation should retain the key feature of the datasets and obey 
the rainfall characters in specific location [1, 5]. Hence, accomplishing the finest results in data analyses 
before proceeding to modeling, the data must be complete with good quality. 

Generally, a few methods are used to manage the missing data. Normal Ratio method has become a 
generic method for estimating missing rainfall data [6]. It was initially recommended by [7], and afterwards 
altered by [8] and became a generic method for rainfall missing data valuation [9]. If any adjacent gauges had 
normal annual precipitation surpassing 10% of the measured gauge, this method was employed [10]. It was 
based on previous observations of rain gauge and the surroundings. Nonetheless, other significant factors 
include distances among rain gauges and aerial coverage of respective gauge which were considered when 
using this method and had become evident to significantly influence the rainfall valuation. Unlike 
the common methods listed, NIPALS and RF algorithm has the capability of preserving high dimensional 
data and constructing the efficient complete datasets of rainfall data. Furthermore, there is desirable 
properties of these algorithms which is its capability of handling diversified types of missing data. They 
could potentially scale to big data settings and adapt to nonlinearity and interactions [11]. 

On the contrary, there were alternative techniques which used other factors to evaluate the missing 
rainfall data. A number of researchers had presented means to manage the challenges in the missing data in 
their studies, specifically in the field of hydrology. One of the methods was inverse distance methods, which 
utilized the distances from the target station of two to five neighbor stations, giving more weight to the data 
from the nearest weather station [12]. Besides that, regression method was also a well-known method used 
for estimating missing values in hydrological data [13]. A different part in regression model was that it also 
considered other factors in order to impute the missing hydrological data such as elevation 
and topography [14]. Regression method employed step-wise regression to determine the coefficients for all 
the significant neighbor stations [15]. Regression-based method underestimated the amount of no rainfall 
days. However, regression methods suffered from the overestimation of rainy days total. Additionally, 
the probability distribution of rainfall was not well-preserved [16]. The regression method might 
misrepresent the amount freedom degrees and was challenging in noisy datasets [17]. 

The objective of this study are twofold, first to performed data imputation using Replace by Mean, 
Nearest Neighbor, Markov Chain Monte Carlo (MCMO), Nonlinear Iterative Partial Least Squares (NIPALS) 
and Random Forest (RF) methods for daily rainfall data in the East-Coast of Peninsular Malaysia. Second, 
to evaluate the performance of imputation methods coupled with Multiple Linear Regression (MLR) model 
in predicting the future daily rainfall values. The findings from this study is expected to contribute towards 
finding the best and finest method for data imputation technique that enables the reconstructions of complete 
rainfall datasets. 


2. STUDY AREA AND DATA 

This study centers on east-coast of Peninsular Malaysia that places in the latitude between 3.5° N 
and 6.5° N and longitudes 102° E and 104° E. The datasets used in this study is high dimensional data which 
were obtained from daily rainfall data of 1987-2018 from the Department of Irrigation and Drainage 
Malaysia (DID) for 32 years as represented in Figure 1. That 48181 data contained 8.59% missing values. 
According to [1], the datasets that contained of less than 10% of missing values are regarded as excellent 
data. A huge number of time series observations were needed to get a precise outline of the rainfall patterns 
[18]. Other than that, the reliability of frequency estimator of a long time series data is highly valuable since 
it strongly associates with sample size in data analysis. Table 1 shows the geographical coordinates 
of 48 rainfall stations chosen from east-coast of Peninsular Malaysia. 
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Figure 1. The location of 48 rainfall stations in east-coast peninsular Malaysia 


Table 1. Geographical coordinates and percentage of missing values for east-coast of peninsular Malaysia 


Stations Code Stations Code 
Setor JPSKT El Stor JPSRaub E25 
Kg. Sg. Tong E2 __ Pejabat JPS Pahang E26 
Kg. Dura E3 Rumah Pam Paya Kangsar E27 
Kg. Menerong E4 JKRBenta E28 
Kg. Embong Sekayu ES Kg. Sg. Yap E29 
Jambatan Jerangau E6 ~=Kawasan B Ulu Tekai B30 
SM Sultan Omar E7 Kg. Merting E31 
Al-Muktafi E8 _ Bkt. Betong E32 
Rumah Pam PayaKempian E9 Ulu Tekai(A) E33 
Sg. Gawi E10 Kuala Tahan E34 
Jambatan Tebak E11  Gunong Brinchang E35 
Kg. Ban Ho E12 Kg. Laloh E36 
Hulu Jabor E13 Ulu Sekor E37 
Kg. Batu Hampar E14 _ Dabong E38 
Klinik Chalok Barat E15 Gob E39 
Inst. Pertanian Besut E16 Balai Polis Bertam E40 
Sg. Kepasing E17 = Sek. Men. Teknik Kuala Krai = E41 
Temeris E18 Air Lanas B42 
Sg. Cabang Kanan E19 Kg. Durian Daun B43 
Kg. Unchang E20 _ Bendang Nyior E44 
Kg. Batu Gong E21 Rumah Kastam B45 
Kuala Marong E22 Blau E46 
Rumah Pam Pahang Tua E23 Gunung Gagau E47 
Pintu Kawalan Pulau Kertam E24 Brook E48 


3. RESULTS METHOD 
3.1. Replace by mean 

The easiest imputation technique used when the missing data are less than 10% is mean substitution 
method [19]. It comprises changing every missing value in the series X“), k=1,...,d by the corresponding 
mean of respective component. This method had been employed in multivariate hydrological frequency 
analysis studies [20]. It was concluded that replacing missing data with overall average of the whole 
observations of the data provided excellent results [15]. The formula can be written as: 


nm 
Pi 


n 


B= (1) 
where P, is the observed rainfall data while n is the number of rainfall days. 
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3.2. Nearest neighbor 

Another effective methods to fulfil the missing data are nearest neighbor imputation algorithms. 
Every missing value on a number of records is substituted by a value acquired from interrelated cases in 
the overall records set [21]. This method is based on the k observed values of the most similar time series. 
Then every value is employed into a single value with approaches like kernel function or the average 
methods [22]. Nearest neighbor imputation approaches are donor-based methods. The imputed value is 
regarded a value that was essentially valued for other dataset record or the measured values average fromk 
records [23]. The process of imputation by nearest neighbor can be briefly explained as — Let n observations 
on p covariates be gathered. The corresponding n x p data matrix is given by X=(x,,), where x;, denotes 
the ith observation of the sth variable. Let O=(0;,) denote the corresponding nxp matrix of dummies with 
entries; 


2 { 1 if x,, was observed (2) 
%is = 0 for missing value 
distances between two observations x; and x,, which are signified by rows in the data matrix, can be 
calculated by using the L,-metric for the data observed. Then one uses the distances ; 


1/q 
1 q 
dg (Xj, xj) = [EE? sb - Xjs| I(0;, = 1)I (Oj = 1) (3) 


where mj = Le_, 1 Cos = DI (os = 1) denotes the number of valid components in the computation of 
distances. Parallel view conceptualize the distances and hence nearest neighbors were used [24]. 


3.3. Markov chain monte carlo (MCMC) 

Other than any other methods for imputation of missing data, some still cannot be calculated 
explicitly due to missing data or complex dependence. So, the researchers can also use MCMC. The 
imputation has been performed by Monte Carlo simulation of MCMC method. According to [2], the 
expectation-maximization (EM) is a technique which figures the maximum probable valuations for MCMC 
method to replace missing data. MCMC method was used for the multi imputation procedure because of 
assuming multivariate normality. MCMC method is based on Bayesian inference with missing data by 
obeying several steps [25]: 

—  Imputation step. Estimate mean and covariance matrix, then simulates the missing values for each 
observation. 
— Posterior step (P-step). P-step simulates the mean vector and covariance matrix from the imputed step. 


3.4. Non-linear interactive partial least-square (NIPALS) 

Principal component analysis with missing values could be allowed by using NIPALS method is a 
method. It iteratively applied PCA to the datasets with missing values. The NIPALS algorithm is employed 
on the dataset and the attained PCA model is employed to expect the missing values [26]. The algorithms of 
NIPALS work as follows: 

Given a rectangular data table of size n xp, let us denote by X = {xij},1<i<n1<j<p, 
the matrix representing the observed values of the variables x .j for n statistical units. Next, if X is of rank a, 
then the decomposition formula for principal component analysis of X is X = ¢_,t,p,, where 
th = (thy thie tan)? and Pa = (Pav +Phrj)Php) are the principal factors and principal components, 
correspondingly. Hence, the NIPALS algorithm valuates a missing value conforming to the cell (i,j) as: 


4 = Dik tupy (4) 


where k (k < a) is established by cross-validation. Execution of the NIPALS algorithm is straight forward, 
the base is simple linear regressions. 


3.5. Random forest 

Random Forest can manage mixed data type. It has been identified to be able to work efficiently 
under barren circumstances like non-linear data structures, complex interactions and high dimensions. 
According to [27], random forest is capable of dealing with mixed data type and as a non-parametric method, 
non-linear (regression) and interactive effects are probable. Let us assume X=(X,,X2,....,Xp) 
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to be a nxp-dimensional data matrix. For an arbitrary variable X, including missing values at entries 
icy C€{1,....n} the rainfall dataset in this study could be separated into two categories: 
- The observed values of variable X,, denoted by ys. 


- the missing values of variable X,, denoted by Vos 


For start, an initial guess for the missing values in X is made using mean or other imputation 
method. Next, the variables X,,s=1,...p is sorted based on the overall missing values beginning with 


the smallest amount. For each variable X,, the missing values is imputed by first fitting an Random Forest 
(s). 
obs’ 
The imputation procedure is repeated until a stopping criterion is satisfied. 


with response ys. and predictors x and next, predicting the missing values yo). by applying the trained 


Random Forest to eo) 
3.6. Multiple linear regression 

After all missing values are replaced by several approaches, the complete data set is analyzed using 
multiple linear regression to identify the best approaches of handling the missing data in east-coast 
Peninsular Malaysia. Regression analysis is a statistical method which uses the connection between at least 
two quantitative variables towards expected variables [28]. A common statistical method employed in a lot 
of disciplines including climate data is MLR model [29]. The Multiple linear regression model parameter can 
be stated as follows: 


Yi = Bo + BiXin + BoXin +7 By Xin + &(B),i = 1,...,N (5) 


where Yj;is the value of response variable, Bo, 6,, 2 and B, are unknown constant, X, is value of predictor 
variable ¢; is the random error. 


3.7. Root mean square error (RMSE), mean absolute error (MAE) and nash-sutcliffe efficiency 
coefficient (CE) 

The root mean square error (RMSE) has been used as a standard statistical metric to measure model 
performance in meteorology, air quality, and climate research studies. RMSE presents information 
on the short-term efficiency which is a benchmark of the difference of predicated values about the observed 
values. The lower the RMSE, the more accurate is the evaluation. Another statistical approach for model 
performance evaluation is Nash-Sutcliffe efficiency coefficient (CE). The CE is a popular index to assess 
the predictive power of hydrological models [30]. CE value of 1 are pursued in the best performance models. 
The mean absolute error (MAE) is another useful measure widely used in model evaluations. MAE 
(mean absolute error) is an indication of the average deviation of the predicted values from the corresponding 
observed values and can present information on long term performance of the models; the lower value 
of MAE represents the better results for the long term model [1]. The RMSE, CE and MAE are given 
by the following formula; 


RMSE= ASP Gy, —¥)2/n (6) 


CE = 1-22.40; — ¥)? / Zhai — H)? (7) 
MAE=D? IG; — HI (8) 


where y; is the observed rainfall, y; is the predicted rainfall data, and ¥; indicates the average rainfall data 
over rainfall station in east-coast Peninsular Malaysia. 


4. RESULTS AND DISCUSSION 

This section discusses the results of imputation methods for daily rainfall datasets. The imputation 
methods were applied for 48 stations in east-coast of Peninsular Malaysia. The experiments were conducted 
for each station, using all five imputation methods. The results were then calculated as an average results, 
representing each imputation method. Root Mean Square Error (RMSE) and Nash-Sutcliffe Efficiency 
Coefficient (CE) were used in evaluating the performance of each method. If the discrepancy between 
the estimated and observed values for each station were small, RMSE will display the smallest values. 
Meanwhile, CE values may varied from -« to | and deemed satisfactory when the values are higher than 0.5. 
The method with smallest RMSE and highest CE values were selected as the best technique in filling 
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the missing data of daily rainfall datasets. The experimental results for each imputation methods are showed 
in Table 2. 

Table 2 indicates the average RMSE and CE values for five methods. The results showed that, 
the smallest RMSE with highest CE was obtained from Replace by Mean method. However, CE values 
showed that all imputation methods generate satisfactory results whereby the obtained values were 
approximate to |. In view of the obtained results, Replace by Mean provided the most fitting performance. 
Meanwhile Random Forest (RF) was the worst imputation method for daily rainfall data in east-coast 
Peninsular Malaysia as the results showed RF has the lowest CE and highest RMSE amongst other methods. 


Table 2. Average RMSE and CE values for five imputation methods 
Method RMSE CE 
“Replaceby Mean 2.3100* 0.9873" 
Nearest Neighbor 4.2362 0.9597 
MCMC 5.0208 0.9461 
NIPALS 3.8386 0.9703 
Random Forest 6.3337 0.9150 
* indicate the best results 


Once the missing values have been filling in, the next step in this study is to analyze the full dataset 
using Multiple Linear Regression (MLR) model. The MLR model was used to identify the best approaches 
of handling missing data when the imputation values coupled with modelling. To evaluate the performance 
of imputation methods coupled with MLR model, MAE and RMSE were used respectively. 

Table 3 presents the RMSEand MAE values for each statistical approaches for imputing the missing 
values of daily rainfall data in east-coast Peninsular Malaysia coupled with multiple linear regression model. 
It can be observed that RF-MLR has the lowest RMSE and MAE of 16.4428 and 8.6229, respectively 
compared to other approaches. Thus, the final results suggested that Random Forest is the best statistical 
approach for imputing the missing values of daily rainfall data when it coupled with regression model. 
Table 3 also showed that imputation method of Replace by Mean-MLR has relatively smallest values 
for RMSE and MAE, making the model competitive to RF-MLR. 


Table 3. The results for MLR coupled with imputation methods 


Model RMSE MAE 
“Replace by Mean-MLR 16.9564 89195 — 
NN-MLR 17.4154 9.2199 
MCMC-MLR 17.2803 9.3224 
NIPALS-MLR 17.2179 9.0531 
RF-MLR 16.4428 8.6229 


Finally, the observed and predicted values for Replace by Mean-MLR, NN-MLR, MCMC-MLR, 
NIPALS-MLR and RF-MLR models were plot for visual inspection. Figure 2 shows the results of five 
imputation methods for 175 missing daily rainfall data in the east-coast Peninsular Malaysia at station Setor 
JPS KT (El). Based from Figure 2, it can observed that the imputed values of daily rainfall data by using 
RF-MLR and NN-MLR showed similar trends. For instance, both models were responsive to rainfall 
occurrences with similar magnitude peaks and times. However, RSME and MAE for RF-MLR was 
significantly lower compared to NN-MLR. In contrast, Replace by Mean-MLR generally underestimated 
the discharges since the plotted line of the values of estimated tend to flatten out which did not seemto show 
any trends. With these results, RF-MLR was considered the best model at filling the gaps in the missing 
values of daily rainfall datain east-coast Peninsular Malaysia. 
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Rainfall (mm) 








Figure 2. Data imputation results of 175 missing rainfall data for the MCMC, NIPALS, nearest neighbour 
and replace by mean models. Five line graphs represent the observation (blue), MCMC (red), NIPALS 
(green), NEAREST NEIGHBOR (PURPLE) AND REPLACE BY MEan (black) 


5. CONCLUSION 

The search of the most efficient method for imputation the missing values for rainfall data has 
continuously received a huge attention in many studies. In this study, five imputation methods which are 
Replace by Mean, NN, MCMC, NIPALS and RF were used and compared to obtain the most ap propriate 
technique in filling the missing data for daily rainfall data in East-Coast Peninsular Malaysia. The results 
showed that Replace by Mean is the best method for single imputation. However, RF has proven it 
superiority as the method having the best result when coupled with MLR. The study has found out that 
the by using Replace by Mean, the dataset are prone to the risk of changing the standard deviation and the 
skewness of the data might change as well. Furthermore, it is also confirmed that, performance of predictive 
modelling coupled with imputation method may differ from single imputation alone. Therefore, in finding 
the best method for data imputation, it is crucial to test the dataset after imputation with any predictive 
modelling. To conclude, the use of various imputation techniques based on the characteristics of rainfall 
is endorsed and further studies with different methodologies and datasets should be explored. 
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